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ABSTRACT 

A sequence  of  empirical  Bayes  estimators  is  defined  for 
estimating,  in  a two-sample  problem,  the  probability  that  X £ Y. 
The  sequence  is  shown  to  be  asymptotically  optimal  relative  to  a 
Ferguson  Dirichlet  process  prior. 


1 . INTRODUCTION 

Let  X and  Y be  two  real  valued  independent  random  variables 
with  distribution  functions  F and  G,  respectively.  We  consider 
the  problem  of  estimating  the  probability  that  X £ Y,  denoted  by  A, 

A -/FdG. 
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This  problem  was  recently  treated  by  Ferguson  (1973)  who  used 

a nonparametric  Bayesian  approach  based  on  the  Dirichlet  process. 

Ferguson's  nonparametric  Bayes  estimator  of  A is  as  follows.  [For 

Dirichlet  process  definitions  see  Section  2,  and  for  more  details, 

see  Ferguson's  (1973)  paper.]  To  estimate  A,  Ferguson  lets 

X.  , . . . ,X  be  a sample  from  F where  it  is  assumed  F is  a random 
1 nl 

distribution  function  chosen  by  a Dirichlet  process  P^  with  para- 
meter a, . Furthermore,  Y, Y is  a sample  from  G where  G is 

1 l n2 

chosen  by  a Dirichlet  process  ?2  with  parameter  02 , and  P^  and  P2 
are  independent.  For  squared  error  loss,  Ferguson's  Bayes  esti- 
mator of  A is  given  by 

n2 

A*  * p.  p-  A-  + p.  (1  - p,  )^  l FQ(Y.) 

1 Ql 

+ (1  - P )P  ± la-  g (x")) 

1»n1  2*n2  nl  i-1  0 1 

+ (1  - Pl  )(1  - (l.D 

l»Oj_  *»n2  nln2 

where 


ax(R) 


a2m 


*l,n^  * a1(R)  + n ^ P2,n2  = a2(R)  + n2' 


Fq(x)  - a1((—,x])/a1(R),  GQ(y)  - a2((-®,y])/u2  W , 


(1.2) 


(1.3) 


Ao  * /F0dG0’ 


(1.4) 


R is  the  real  line  and  U,  the  number  of  pairs  (X^,  Yj)  for  which 


xi 5 v 


nl  n2 


U - l l I,  . Y ](X  ) 

i»l  j-1  < 'V 


(1.5) 


is  the  Mann-Whitney  statistic.  Here 


*A®  * 


1 if  x e A 

0 if  x i A. 


(1.6) 


*.-n  ■ •'  ,*«r* 


.j.^1  ^rrarr  r 
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Ferguson  notes  that  the  estimator  A is  a simple  mixture  of  four 


separate  estimators  of  A.  As  both  o. (R)  and  a,(R)  tend  to  zero. 


A converges  to  (n-n«)  u,  the  usual  nonparametric  estimator. 

•**  ^ A* 

Motivated  by  Ferguson's  A , we  propose  an  empirical  Bayes 
estimator  of  A which  requires  less  prior  information  about  a^(*) 
and  c^O).  Only  a^(R)  and  a2 (R)  need  be  specified.  Consider, 


then,  the  following  set  up  appropriate  for  an  empirical  Bayes 

estimation  problem.  Let  {X^},  {X^2)>,  i - 1,  2 be  two 

independent  sequences  of  independent  vectors  of  observations  from 


F and  G respectively.  Here  X 


(j) 


(X^,...,!^),  j - 1,  2 and 


X XX  XUi 

i-1,  2 Assume  Independent  Dlrichlet  priors  on  (R,  8) 

with  parameters  and  respectively  for  F and  G.  Here  R is  the 


real  line  and  B is  the  o-field  of  Borel  subsets  of  R.  Let  the 
action  space  be  the  closed  interval  [0,  1],  and  the  loss  function 
be 


L(A,  A)  - (A  - A)  , 


(1.7) 


where  A is  an  estimator  of  A.  We  assume  ct^(R)  and  a^(R)  are  known. 


We  then  propose  the  estimator  Ar  below  as  an  estimator  of  A 


on  the  (n+l)th  occasion.  The  estimator  is  given  by 


A -A(X(1)  X(1)-  X(2)  X(2)) 

n ^ 1 ’•••’  rri-1’  X1  n+l; 


n n2  n n^ 

" pl,n,p2,n,  ~T l l l X X(_  X(2)](XS>) 

* 1*2  nVn  i-1  j-1  k-1  £-1  ,Aij  J 

n2  n nx 

+ Pln  (1-P2  n )~r-  l l l UD((- ,X^  J) 
l,nl  2,n2  nnln2  j-1  k-1  £-1  n+1,j 

nl T n n2 

+ (1  - p )p2  ~ l ■ 1 - l l 6X(2)((— 

1,nl  2^nli-ll  nn2  j-1  k-1  Xjk  °+U 

nl  n2 

+ (1  - p )(1  - p )— i-  l l I.  „(2)  -.(X^J  ), 

l,nl  2,n2  nln2  i-1  j-1  ( »Xn+l,j]  +1,i 


(1.8) 


— 


J 


where  p.  , i 
1*ni 


l,  2,  are  given  by  (1.2),  and 


6x(A) 


1 if  x e A 
0 if  x l A. 


(1.9) 


Note  that  the  first  three  terms  in  (1.8)  are  the  natural 

estimators  of  corresponding  terms  In  the  Bayes  estimator  based  on 

all  the  observations  or  only  the  past  observations. 

In  Section  3 we  prove  that  the  sequence  D =»  (A  } is  asymp— 

n 

totically  optimal  in  the  sense  of  Robbins  (1964) . Thus  even  though 
one  need  only  specify  a^(R)  and  c^R),  the  procedure  is  asymp- 
totically as  good  as  though  a^(*)  and  o^O)  were  known  exactly. 

Empirical  Bayes  methods,  based  on  the  Dirichlet  process,  have 
also  been  used  by  Antoniak  (1974)  for  a model  based  on  mixtures 
of  Dirichlet  processes,  by  Korwar  and  Hollander  (1976)  for  estimating 
the  mean  of  a distribution,  by  Korwar  and  Hollander  (1976)  and 
Hollander  and  Korwar  (1976  ) for  estimating  a distribution  function, 
and  by  Susarla  and  Van  Ryzin  (1976)  for  estimating  a distribution 
function  when  the  observations  are  censored  on  the  right. 

2.  DIRICHLET  PROCESS  PRELIMINARIES 

In  this  section  we  present  some  Dirichlet  process  definitions 
and  results  that  will  be  used  in  our  proof  of  asymptotic  optimality. 
See  Ferguson  (1973) , (1974)  for  more  comprehensive  coverage  of 
results  pertaining  to  the  Dirichlet  process. 

Definition  2.1  (Ferguson,  1973).  Let  Z^,...,Z^  be  independent 

random  variables  with  Z^  having  a gamma  distribution  with  shape 

parameter  a..  2 0 and  scale  parameter  1,  j = l,...,k.  Let  > 0 

for  some  j.  The  Dirichlet  distribution  with  parameter  (a^, . . . .a^) s 

denoted  by  0(3. , . . . ,a,  ) , is  defined  as  the  distribution  of 

k 

(Y1,...,Yk),  where  Y^  - Z^/  £ Z^  j » 1, . ..  ,k. 

This  distribution  is  always  singular  with  respect  to  Lebesgue 
measure  on  k-dimensional  Euclidean  space.  Also,  if  any  = 0, 
the  corresponding  Y^  is  degenerate  at  0.  However,  if  > 0 for 


all  i - l,...,k,  the  (k-l)-dimensional  distribution  of  (Y^, . . . ,Yk_^) 
has  density,  with  respect  to  Lebesgue  measure  on  the  (k-1) -dimen- 
sional Euclidean  space,  given  by 


I » • • • 


r (a-+. , ,+a,  ) k-1  a -1  k-1  a, -1 

r(a1)...r(ak)  (i"1yi  )(l"J1yi)  IS(yl‘','»yk-l)* 


(2.1) 


where  S is  the  simplex 

k-1 

s - {(y1....,yk-1):  y1  * 0,  ^y±  s i>. 


For  k ■ 2,  (2.1)  becomes  the  density  of  a Beta  distribution, 

Be(o^,  02).  Note  that  the  condition  > 0 for  some  i • l,...,k, 

k 

is  required  in  Definition  2.1  so  that  [ Z.  is  not  degenerate  at  0. 

i=*l  X 

~ Let  (X,  A)  be  a measurable  space.  Ferguson  defined  the  follow- 
ing stochastic  process  {P(A),  A e A}. 

Definition  2.2  (Ferguson,  1973).  Let  (X,  A)  be  a measurable  space. 
Let  a be  a non-null  finite  measure  (nonnegative  and  finitely 
additive)  on  (X,  A).  Then  P is  a Dirichlet  process  on  (X,  A)  with 
parameter  a if  for  every  k * 1,  2,...,  and  measurable  partition 
(B^,...,Bk)  of  X,  the  distribution  of  (P(B^) , . . . ,P(B^))  is  Dirichlet 
with  parameter  (a  (B^) , . . . ,a (B^) ) . 

If  F is  chosen  by  a Dirichlet  process,  then  F is  discrete  with 
probability  one  [see  Ferguson  (1973),  Berk  and  Savage  (1975), 
Blackwell  (1973),  and  Blackwell  and  MacQueen  (1973)]. 

A sample  from  a Dirichlet  process  is  next  defined. 


Definition  2.3  (Ferguson,  1973).  The  X-valued  random  variables 
X,  ,...,X  constitute  a sample  of  size  m from  a Dirichlet  process 
P on  (X,  A)  with  parameter  a if  for  any  L - 1,  2,...  and  measurable 
sets  , • . • A £,  Cx , . . • ,C^,  Q{XX  € , . . . ,X^  c C^|  ^ (A-j^) , . . . ,P  (A^)  , 


m 


JI  P(C  ) a.s.,  where  Q denotes  probability. 
i-1 


Roughly  speaking,  we  may  view  a sample  of  size  m from  a 
Dirichlat  process  as  follows.  The  process  chooses  a random  distri- 
bution F,  say,  and  then  given  F,  XJ.....X  is  a random  sample  from 
F. 

Theorem  2.4  gives  the  posterior  distribution  of  a Dirichlet 
process  P,  given  a sample  X^,...,Xm  from  the  process. 

Theorem  2.4  (Ferguson,  1973) . Let  P be  a Dirichlet  process  on 
(X,  A)  with  parameter  a,  and  let  X^,...,Xm  be  a sample  of  size  m 
from  P.  Then  the  conditional  distribution  of  P given  X^,...,Xm 

m 

is  a Dirichlet  process  on  (X,  A)  with  parameter  0 ■ a + £ , 

i=l  i 

where,  for  x c X,  A c A,  6x(A)  is  given  by  (1.9). 

Theorem  2.5  is  a generalization  of  Ferguson's  (1973) 
Proposition  4. 

Theorem  2.5.  Let  P be  a Dirichlet  process  on  (R,  8)  with  parameter 

a and  let  X, ,...,X  be  a sample  of  size  m from  P.  Then 
x m 

Q{X,  ^ i x } = {o(A  )...(a(A  ) + m-l)}/{o(R) . . . 

11  m m x(l)  X(m) 

(a(R)  + m-1) }, 


where  x^  £...£  x^  is  an  arrangement  of  x1,...,xm  in  increasing 
order  of  magnitude,  A^= (-°“,x] , and  Q denotes  probability. 

Proof.  Observe  that  A = A , k = l,...,m-l.  We  write 

x(k)  X(k+1) 

A - B., 

X(l) 

A “A  + A^  A +. . .+  A^  A 
X(k)  X(l)  X(l)  (2)  (k-1)  X(k) 

* + B2  k * 2, . . . ,m. 


A “ B ... 
x.  . m+1 

(m) 


6 


7-ir: 


Here  A denotes  the  complement  of  the  set  A.  Then 


Q{X1 Sx. , . . . ,X  Sx  } = Q{X  cA  , . . . ,X  eA  }, 

1 1 m m x x g j \ m ^ 

m 


where  (i.,...,i  ) is  a permur  .tion  of  (l,...,m).  Now 
i m 


Q{X.  e A , . ,X  c A } 

1 *<o  " %aj 

1 m 


E[Q{X1  eA  ,...J«Av  |P(A 

x(i.)  m x(i  ) x(i.)  x(i  ) 

1 ml  m 


}] 


E{P  (A  ) * • *P (A  )}, 


X(i  ) X(1  ) 

± m 


by  Definition  2.3.  Since  (P^)  , . . . .PCB^))  is  PfaCB^  , . . . .aCB^ 
using  the  moments  of  the  Dirichlet  distribution  we  can  obtain 


E{P(A„ 


(V 


) • • • P (A 


(i) 

tn 


)> 


- E{P(B1> (P (B1)+P(B2>) < 


- {a (A  ) (a (A  )+l)' 
(1)  (2) 


(a (R)+m-l)}. 


(P(B.)  +...+  P (B  ) ) } 

1 m 

• (a (A  )+m-l) }/ {a(R) (a(R)+l) • • • 

X(m) 


Theorem  2.6  (Ferguson,  1973) . Let  P be  the  Dirichlet  process 
(Definition  2.2)  with  parameter  a and  let  Z^  and  be  measurable 
real  valued  functions  defined  on  (X,  A).  If  /|z^|da  < »,  / 1 I dot 
and  /jz^Z^dci  < «,  then 

E/zidp/Z2dP  - {o12/(a(X)+l)>  + u1P2, 

where 


V±  **  / Z^a/aCX),  i - 1,2 


°12 


and 


{/ziZ2da/a(X)>  - y^. 


3.  ASYMPTOTIC  OPTIMALITY  OF  {A  }. 

n 

A 

We  now  establish  the  asymptotic  optimality  of  D » {A  } . In 

n 

our  empirical  Bayes  framework,  Ferguson's  Bayes  estimator  of  A 
based  on  (X<»,  *«),  is 

n2  nl 


nl  n2 


+ (1-p  )(l-p  ) III  m (X(^  ,>/n.n_, 

1 2 i-i  j-i  n+1’i  1 2 


(3.1) 


are  given  Dy 


where  ~ p^  and  p2  e P^ 

(1.3),  A q by  (1.4)  and  IA(x)  by  (1.6).  The  Bayes  risks  R(An>a^,a2) 
and  R(An»a^,a2)  of  (3.1)  and  (1.8)  respectively,  with  respect  to 
the  Dirichlet  priors,  are 


def 

R(A,ai,a2)  = R(An,a1,a2)  = E (1)  (2)E 


(1)  v(2) v “n 


vU/  v r 

n+1 » n+l  F ’ G 1 n+1 ’ n+l 


(A-A*)2,  (3.2) 


and 


R(An’al’°t2)  " Ev(l)  y(2)E  rlv(l)  v(2)(A'Any 


(3.3) 


7 yv  7 r r y ' ' yv 
n+l ’ n+1  F ’ G 1 Xn+1  ’ n+1 


Let  ®-n(^»ai>a2)  be  the  expectation  of  R(An,a^,a2)  with  respect  to 

X,^,  X^2'  , . . . ,X^  , X^2^  (the  past  observations). 
l i n □ 

Definition  3.1.  The  sequence  D * {A^}  is  said  to  be  asymptotically 

optimal  relative  to  (a^,a2)  if  Rn(D, ct^,a2)  converges  to  the  minimum 
— 

Bayes  risk  R(A  ,o^,a2),  as  a ♦ 

Definition  3.1  of  asymptotic  optimality  is  given  here  in  the 
specific  setting  of  the  problem  under  discussion.  For  a more 
general  definition  see  Section  2 of  Robbins  (1964). 

Theorem  3.2.  Let  ci^(R)  and  o2(R)  be  known.  Then 


R(A*,a1,a2)  = EA2  - EA*2, 


(3.4) 


8 


r 


where 


? ~*2  "2  '*2 

R (D,a1 ,a_)  = (EA  -EA  ) + (EA  -EA  ), 
n i z a 


EA  = PltlP2>iA0  + (1-P1,1)(1-P2,1)A0 


+ (1_P1,1)P2,1I1  + 


EA  2 - 8182ao  + (1-8]_)  (1~82)Aq  + (1-81)82i1 


8^  Cl  §2^  ^2  > 


(3.5) 


(3.6) 


(3.7) 


where 


and 


EA2  » hLh2A2  + (l-h1)(l-h2)A0  + (l-h1)h2I1 


+ hj^Cl-hj)^, 


h " /F0(y(l))dG0(yl)dG0(y2)» 


I2  = /F?(y)dG0(y), 


pl,lpl,n- 


P2,lP2,n. 


8i  - 8 


1 Bl*nl  Pl.n^l  ’ §2  82,n2  P2,n2+1 


h,  = h. 


8i  P-" 


‘1  ~ “l,ni  &1 


l,n.+l 


nn. 


f P 


* h2  = h2,n2  = g2 


1- 


2,n2+l 


nn. 


(3.8) 


(3.9) 

(3.10) 


(3.11) 


In  particular  D = {An>  is  asymptotically  optimal  relative  to  (a^,ct2) 


Our  proof  of  Theorem  3.2  uses  the  following  lemma. 

.(1)  = , (1)  y(l) 

i ' il  in. 


Lemma  3.3.  Let  xf^  = (xf  ^ , . . . ,X^^ ) , i = 1,2,  be  two  independent 


samples,  each  of  size  n^,  from  a Dirichlet  process  on  (R,  8) 
with  parameter  a^,  and  let  X^2^  = (X^2\ . . . ,X^2^ ) , i = 1,2,  be  two 

independent  samples,  each  of  dize  n2»  independent  of  (X^ , X^), 


■NUMB 


• liajyeBa  Amu 


9 


from  an  independent  Dirichlet  process  on  (R,  B)  with  parameter  c^. 


Assume  that  each  of  a^,  is  a-additive.  Then 


■ . x«, > ' efo(xu2,) 

( ’Xij  ] 


(3.12) 


E6®((-”*Xij  ])  “ EF0(Xia  >’ 


(3.13) 


r(2)(cxii  = EFo(xij;>> 


(3.14) 


efo(x«2>)  ' V 


(3.15) 


(3.16) 


(2)  NT.  /v(2) 


W » _(2)  (^)  ^Fo(X^)F0(x^,), 


(3.17) 


E<i-°o<4t>  <Vxk£ 

<_*,X«  1 <3.18, 

EVi)(<-'xu],Vi)  • 

V-fc' 


EF0(XiJ>)F0(Xi’j,)  ’ k * k' 


{o1(R)EF0(X^))Fq(X^,)  + EF0(xJ^)}/(ai(R)  + 1),  (3.19) 


FFo<x(i)) 


k=k'  , l*V 

,Tt=k' , £=£’ 


L 


"o<x«)>Fo(!tn')  k'“'  (3 

■ 

{a1(R)EF0(X^))F0(X^^I)+EF0(xj^)>/(o1(R)+l), 

k-k'. 


ESI(2)(tXW>-*»4I(2,  <«$•*»  ' 

ij  i'j’ 


{a1(R)EF0(X^))F0(X^,)+EF0(X^j)}/(o1(R)+l), 


EVxm>- 


' . (2),(Xk£>)5  (1)  ’ 

^ >Xij  J 

eFqCx^^FqCx^,),  k*k’ 

C 

• {ai(R)EF0(X^V0(X^^  , )+EFQ(X^  ) }/  (0l(R)+l)  , 

k-kr,  l*V 


EVxa))’ 


k-k',  l-V , 


(2),(XU))5v(2)  ([Xk’£’ ,0o)) 


xv 

’ ij 


Y'"' 

Xi'j’ 


EFo(xlf)ro(xl'r,> 


k*k' , 


{o1(R)EF0(X^))F0(X^^ , )+EFq(X^)  }/  (o^RHl)  , 

k-k',  L*V 


(3.23) 


EF0(Xg) 


, k-k’,  l-V . 


/o)  (2) 

In  (3.19)  - (3.23),  X^  is  the  smaller  of  X^  and  X^,. 
Mso, 


CTo(xa)> 


/F0(F<l>)dC0<yl)dG0<y2>’  1*1' 


/F0(y (l)^dK0^yl,y2^ ’ 


i=i\ 


(3.24) 


E,o(xS))7o(x”]') 


/F0(yl)F0(y2)dK0(yl‘y2)  (3-25) 


JFo(y>dGo(y) 


i-i\j-J\ 


where  y,...  is  the  smaller  of  and  y2»  and 


E(l-co<xS>‘))<1-Go(*k'P) 


, k*k' 


/(l-C0(*"))(l-C0(x2))dH0(x1,*2),  k-k ',l*V  (3.26) 


J(l-G0(x"))*dF0(x) 


1 1 itiB 


Remark  3.4.  By  using  the 
c*2((— ,y2])+Sy  ((— ,y2D) 

a2(ft)+l 


conditional  distribution  Lg(y2|y^)  * 
o£  , given  X^  (j*j'),  we  can  show 


that 


^F0(y(l))dK0(yl’y2)  = C“2(R)Il+A0]/(a2(R)+1)‘  (3'27) 


/F0(yl)F0(y2)dK0(yl’y2)  * Co2(R)&J+I23/(o2(R)+1).  (3.28) 

£2^  (2) 

The  function  KQ(y^,y2)  is  the  distribution  function  of  X^  , 

(j*j'),  and  yn.  is  min(y.,y9).  Similarly,  by  considering  the 
conditional  distribution  of  X^,  given  X ^ (£*£.’)  we  can  show 

/(l-G0(x^))(l-G0(x2))dH0(x1,x2)  - Co1(R)A2+I1]/(a1(R)+l).  (3.29) 

**  2 

Proof  of  Theorem  3.2.  To  prove  (3.4),  expand  (A  - A ) , and  note 
that  A .being  the  Bayes  estimator  with  squared  error  loss,  is 
E(A|X^,  X^2^) . To  prove  (3.5),  expand  (A  - Aq)2,  an<*  note  that 

E«V  - E«wi„l^1)-^2) 

= E(A  A*)  - E(A*E(A  |x(ii,X(2b)  =■  EA*2.  (Here  we  used  the  fact 
n n nri  n+1 

that  E(A  IxfH.X^2?)  * A , which  is  easily  verified.)  To  prove 

EL  QTl  IItI 

(3.6),  use  Theorem  2.6  twice.  Thus, 

EA2  - E(/F(y)dG(y))2 

- E(E(/F(y)dG(y))2|F) 

- E{( / F^y) dGQ  (y)  / (a2  (R>1)  ) H>2  (R)  (/ F (y) dGQ (y)  ) 2/  (a2  (R)+l)  > 

i(/yQ(y)(a1(R)F0(y)-*-l)dG0(y)/(ci1(R)+l))+a2(R)E(/(l-G0(x'))dF(x))2} 

” a2(R)+l 

C(02a1  (R)+Aq>/  (a,  (R)-«3)+n2 (R)  {/ (1-GQ (x") ) 2dF()(x)-h3i1  (R) aJ}/^ (R)+l)  ] 

__ 
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which  is  (3.6).  (In  the  above  derivation,  use  the  second  moment  of 

a beta  distribution  to  get  from  the  third  equality  to  the  fourth.) 

**2 

To  prove  (3.7),  expand  A into  ten  sums,  take  expectations  and  use 
Lemma  3.3  repeatedly.  For  example  one  of  the  sums  in  the  expansion 
n2  n2 

is  £ 1 Fn(xili  ,)Fft(x^i  <i)  (apart  from  a multiplicative  constant), 

j i = i u u n-rx,j 

To  evaluate  its  expectation  use  (3.25)  of  Lemma  3.3.  Equation  (3.8) 

~2  **2 

is  similarly  proved  by  using  Lemma  3.3.  To  simplify  EA  and  EA  , 

a n 

use  Remark  3.4.  The  asymptotic  optimality  of  D ■ {A  ) follows  from 

the  fact  that  lim  EA  * EA  , which  follows  by  letting  n -*■<*>  in  (3.8). 
n-x» 

Note,  that  from  (3.11),  it  is  seen  that  the  rate  at  which  R (D,a, ,a0) 

n l i 

converges  to  the  minimum  Bayes  risk  is  l/n. 
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